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ABSTRACT 

Command  and  control  (C2)  and  the  decisionmaking  domain  are  seriously  threatened  facing 
information  overload  and  uncertainty  issues.  To  make  sense  out  of  the  flood  of  information,  military 
have  to  create  new  ways  of  processing  sensor  and  intelligence  information,  and  of  providing  the  results 
to  commanders.  Initiated  in  2004  at  Defence  Research  and  Development  Canada  (DRDC),  the 
SACOT1  knowledge  engineering  research  project  is  currently  investigating,  developing  and  validating 
innovative  natural  language  processing  (NLP)  approaches  as  scientific  means  to  capture  knowledge 
objects  contained  in  domain-specific  electronic  texts  and  turn  them  rapidly  into  broad  domain 
ontologies  to  be  used  in  third-party  applications.  Ontologies  are  key  elements  required  to  enable  next 
generation  of  decision  support  and  knowledge  exploitation  systems  with  new  semantic  capabilities. 
Major  impediments  to  classic  development  of  ontologies  are  that  it  is  a  time  and  budget  consuming 
operation.  It  is  also  largely  dependant  of  Subject  Matter  Experts’  (SME)  own  limitations.  Exhaustive 
elicitation  of  knowledge  objects  of  a  domain  requires  the  application  of  NLP  extraction  techniques 
over  textual  data.  This  paper  illustrates  how  recent  advances  in  NLP  techniques  are  implemented  in  the 
SACOT  framework  to  automate  elicitation  of  knowledge  objects  from  unstructured  texts  and  to 
support  efficiently  SMEs  in  ontology  engineering  tasks. 

1.  INTRODUCTION 

Command  and  control  (C2)  and  the  decisionmaking  domain  are  seriously  threatened  facing 
information  overload  and  uncertainty  issues.  To  make  sense  out  of  the  flood  of  information,  military 
have  to  create  new  ways  of  processing  sensor  and  intelligence  information,  and  of  providing  the  results 
to  commanders  who  must  take  timely  operational  decisions.  Research  in  the  field  of  Information  and 
Knowledge  Management  (IKM)  consists  in  investigating  and  advancing  knowledge  creation  and 
discovery  techniques  through  which  information  is  collected  and  processed  to  support  situation 
analysis  and  gain  sufficient  situational  awareness  to  be  able  to  project  possible  future  courses  of  action 
or  trends  with  confidence.  In  2001,  the  Canadian  Forces  Future  Army  Capabilities  report  [DND,  2001] 
pointed  out  that  “without  some  fundamental  change,  current  army  ISR  will  be  incapable  of  providing 
the  degree  of  knowledge  that  will  be  required  by  future  commanders.”  Therefore  “all  relevant  data, 
information  and  knowledge  must  be  available  at  all  levels,  but  managed  in  a  way  that  produces  a 
current,  rapid  and  coherent  understanding  of  the  battlespace,  while  at  the  same  time  allowing  the 
various  levels  of  command  to  process  the  relevant  material  for  their  specific  purposes.” 


1  SACOT:  Semi-Automatic  Construction  of  Ontologies  from  Texts 
'  Intelligence,  Surveillance,  Reconnaissance  (ISR) 
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Ontologies  are  key  elements  required  to  enable  decision  support  systems,  knowledge  exploitation  and 
information  retrieval  systems  with  new  semantic  capabilities.  Since  Gruber  [Gruber,  1993],  the 
scientific  community  defines  an  ontology  as  a  formal,  explicit  specification  of  a  shared 
conceptualization.  When  domain  knowledge  is  represented  in  a  declarative  formalism,  such  as  in  an 
ontology,  the  set  of  objects  that  can  be  represented  is  called  the  universe  of  discourse.  This  set  of 
objects,  and  the  formalized  relationships  among  them,  are  reflected  in  the  representational  vocabulary 
[id.].  Domain  ontologies  provide  vocabularies  about  the  concepts  within  a  domain  and  their 
relationships,  about  activities  that  take  place  in  that  domain  and  about  theories  and  elementary 
principles  governing  that  domain  [Corcho  et  al.,  2003],  This  paper  illustrates  how  natural  language 
processing  techniques  can  support  and  automate  domain  ontologies  engineering. 

2.  ONTOLOGIES  FOR  INTELLIGENT  COMMAND  AND  CONTROL  SYSTEMS 

Research  from  the  military  community  clearly  indicates  that  there  are  many  needs  and  many  potential 
uses  for  domain  ontologies  within  many  military  areas  similar  to  industry.  With  the  development  and 
maturity  of  the  Semantic  Web  [Davies  et  al.,  2003],  automated  ontology  engineering  will  provide  the 
cornerstone  technology,  which  shares  a  common  understanding  of  a  domain  among  humans,  agents 
and  machines.  Ontologies  for  command  and  control  systems  will  be  instrumental  in  establishing  a 
Common  Operational  Picture  (COP)  among  units  by  making  domain  representations,  situation  analysis 
and  assumptions  more  explicit.  Agents  assisting  commanders  with  the  command  and  control  task  will 
have  the  ability  to  “interpret”  data  and  know  its  meaning  and  value  based  on  domain  ontologies. 
According  to  [Bowman  et  al.,  2001],  in  order  for  Artificial  Intelligence  (Al)  to  become  truly  useful  in 
high-level  military  applications  it  is  necessary  to  identify,  document,  and  integrate  into  automated 
systems  the  human  knowledge  that  senior  military  professionals  use  to  solve  high-level  problems.  This 
paper  [ibid.]  illustrates  this  statement  by  the  development  and  use  of  a  course  of  action  ontology.  If  it  is 
generally  admitted  that  next  generation  of  command  and  control  systems  shall  integrate  and  use 
ontologies,  existing  technologies  and  methodologies  to  rapidly  build  such  ontologies  still  remain  very 
limited. 

3.  NATURAL  LANGUAGE  PROCESSING  FOR  ONTOLOGY  ENGINEERING 

Since  knowledge  objects  of  a  given  domain  are  expressed  and  conveyed  in  texts  using  domain-specific 
tenninology,  it  is  reasonable  to  think  that  mining  and  extracting  this  tenninology  will  lead  us  to  a 
certain  domain  representation  model.  Problem  is  how  to  reach  high  quality  automated  extraction  of 
those  knowledge  objects  in  order  to  build  reliable  ontologies  with  them? 

Initiated  in  2004  at  Defence  Research  and  Development  Canada  (DRDC),  the  SACOT3 4  knowledge 
engineering  research  project  is  currently  investigating,  developing  and  validating  innovative  natural 
language  processing  (NLP)  approaches  as  scientific  means  to  capture  knowledge  objects  contained  in 
open  source  electronic  texts  and  turn  them  rapidly  into  broad  domain  ontologies  to  be  used  in  third- 
party  applications. 


3  See  for  instance  [Bourry-Brisset,  2000;  Chance  &  Hagenston,  2003;  Gauvin  et  al.,  2004,  Gouin  et  at,  2003,  Dorion  &  Bourry-Brisset, 
2004,  Bowman  et  al.,  2001] 

4  Semi-Automatic  Construction  of  Ontologies  from  Texts  (SACOT) 
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As  two  of  the  core  components  of  domain  ontologies  are 
concepts  and  relations  among  concepts,  the  SACOT  project 
encompasses  several  NLP  research  areas.  Identification  and 
extraction  of  concepts  contained  in  texts  is  supported  by 
innovative  terminology  extraction  techniques.  Semantic 
relations  existing  among  concepts  are  identified  and  extracted 
using  other  sets  of  natural  language  processing  techniques. 
Using  the  two  core  components  extracted  from  the  electronic 
texts  (concepts  and  semantic  relations)  and  other  reference 
material,  draft  ontologies  are  automatically  compiled  and 
generated. 

Knowledge  engineers  can  use  this  automated  ontology¬ 
engineering  environment  as  a  knowledge  framework  in  order 
to  validate  and  enhance  the  draft  ontologies.  While  validating 
the  content  of  the  draft  ontologies,  knowledge  engineers  will 
teach  the  system  about  which  among  all  potential  semantic 
relations  identified  in  texts  are  most  valuable  and  which  are 
not  relevant. 

Essentially,  domain  ontologies  are  made  of  sets  of  concepts  (classes)  and  the  relationships  or 
properties  that  can  be  expressed  among  those  concepts.  Figure  1  shows  a  partial  draft  ontology  of 
infectious  diseases  generated  from  a  local  semantic  network  obtained  from  parsing  a  sample  input  text. 

Since  the  building  blocks  of  domain  ontologies  are  concepts  (e.g.  anthrax,  bacillus  antracis, 
SPORE-FORMING  bacterium,  in  Fig.  1)  and  relations  among  concepts  (e.g.  CAUSES,  IS_A),  three  NLP 
techniques  are  being  investigated  in  the  SACOT  framework  to  capture  those  elements:  terminology 
extraction  techniques,  named  entities  extraction  techniques  and  semantic  relations  extraction 
techniques.  Those  three  extraction  techniques  will  be  presented  in  sections  5.2  to  5.4. 

4.  ONTOLOGY  ENGINEERING  METHODOLOGIES 

Most  of  published  ontology  engineering  methods5  require  interviews  with  Subject  Matter  Experts 
(SMEs)  to  elicit  knowledge  objects  of  a  domain.  In  all  the  approaches  relying  heavily  on  SMEs,  the 
extent  of  the  domain  represented  in  the  ontology  depends  on  the  expertise  and  the  degree  of 
“expressiveness”  of  the  available  SMEs.  This  limitation  might  lead  to  unacceptable  and  poor 
performance  of  ontology-based  information  systems.  Typically,  domain  terminology  can  contain  from 
few  hundreds  (e.g.  Professional  Golfers’  Association  (PGA)  Glossary  of  Golf)  to  several  hundreds  of 
thousands  terms  (e.g.  up  to  160,000  terms  in  a  medical  dictionary).  It  is  unlikely  that  any  SME 
interview  will  ever  elicit  the  whole  terminology  of  a  domain.  We  need  to  turn  to  more  exhaustive  and 
objective  data  sources.  The  major  impediments  to  classic  development  of  ontologies  are  that  it  is  a 
time  and  budget  consuming  operation  and  that  it  is  largely  dependant  of  SMEs’  own  knowledge 
limitations.  Exhaustive  elicitation  of  knowledge  objects  of  a  domain  requires  the  application  of  NLP 
extraction  techniques  over  textual  data. 


Fig.  1:  Partial  Draft  Ontology  of  Infectious 
Diseases 


5  [Corcho  et  al.  2003;  Gomez-Perez  1999;  Gomez-Perez  et  al.  2004;  Sure  2003;  Uschold  and  Griininger  1996;  Gruninger  and  Fox  1995] 
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5.  SACOT  ONTOLOGY  ENGINEERING  FRAMEWORK 

5.1  The  Overall  Process 

As  mentioned,  in  traditional  ontology  engineering  methodologies,  SMEs  are  being  interviewed  at  the 
beginning  of  the  process  to  elicit  knowledge  objects.  Methodology  developed  for  the  SACOT 
framework  also  includes  early  interviews  with  SMEs  to  identify  domain-specific  material  (electronic 
texts,  electronic  dictionaries,  if  any,  etc.).  The  SME  is  also  playing  the  role  of  a  knowledge  engineer, 
being  presented  draft  ontologies  for  validation.  He  also  contributes  to  the  maintenance  of  the  ontology. 
Figure  2  below  illustrates  the  overall  ontology  engineering  process  in  SACOT. 

1.  Sources  Identification.  First  step  consists  in 
gathering  and  formating  all  domain-specific 
information  sources.  SMEs  are  consulted  to 
provide  knowledge  engineers  with  reference 
material  that  represents  consensual  knowledge 
sources  among  the  SMEs  community. 

2.  Extraction  Processes.  Domain-specific  electronic 
texts  are  processed  in  three  different  extraction 
modules  to  identify  knowledge  objects.  Next 
sections  (5.2  -  5.4)  describe  those  extraction 
processes. 

3.  Draft  Ontologies  Generation.  Knowledge  objects 
extracted  during  previous  phase  are  then  compiled 
into  a  draft  ontology.  Reference  material  such  as 
core  ontologies,  lexical  ontologies  (e.g.  WordNet), 
and  domain-specific  electronic  dictionaries  or 
thesaurus,  if  any,  is  used  to  guide  the  draft  ontology 
generation  process. 

4.  Draft  Ontologies  Validation.  During  this  phase, 
SMEs  are  required  to  validate  the  content  of  the  draft  ontologies.  An  agent  monitors  the  validation 
work.  Rules  are  derived  from  the  human-based  validation  work  so  that  the  system  can  leam  from 
the  validation  process  and  prune  future  draft  ontologies  according  to  stored  validation  rules. 

5.  Ontology  Maintenance.  Finally,  knowledge  engineers  use  ontology  management  tools  to  manage 
versionning  of  the  domain  ontology  which,  in  turn,  is  reused  as  reference  material  during  the  next 
extraction  cycle. 

5.2  Terminology  Extraction 

Terms  are  linguistic  representations  of  concepts.  Basically,  terminology  extraction  is  the  process  by 
which  raw  terminological  units  corresponding  to  specific  morph- syntactic  patterns  are  extracted  from 
electronic  texts6.  Those  extracted  terminological  units  are  considered  as  candidate  terms  and  need 
further  validation  to  determine  whether  they  belong  to  a  specific  domain  or  are  simply  general 
vocabulary. 

Nowadays,  one  of  the  most  challenging  problems  in  terminology  extraction  is  the  automation  of  the 
validation  process  by  which  raw  candidate  terms  can  be  automatically  assigned  to  specific  domain 
tenninology.  Terminology  extraction  tools  and  techniques  tend  to  generate  huge  amount  of  candidate 


Fig.  2:  SACOT’s  Ontology  Engineering  Process 


6  Details  on  recent  terminology  extraction  techniques  can  be  found  in  [Jacquemin  2001]  and  [Bourigault  et  al.  2001], 


5 


terms  requiring  human  validation.  To  be  fully  effective,  validation  of  candidate  terms  needs  to  be 
automated.  Otherwise,  original  information  overload  issues  will  simply  be  replaced  by  candidate  terms 
overload  ones. 


Recent  advances  in  computational  terminology  suggest  the  use  of  contrastive  datasets  and  statistics  as 
means  to  validate  candidate  terms  [Drouin  2003,  2004].  Using  this  approach,  candidate  terms  are 
extracted  from  two  different  domain-specific  corpora.  Resulting  lists  of  candidates,  together  with  their 
respective  frequency  ratio,  are  then  compared.  If  the  same  candidate  can  be  found  in  both  lists  with 
similar  frequency  ratios,  the  probability  that  it  is  not  a  domain-specific  term  is  very  high.  When  a 
candidate  term  can  be  found  in  both  lists,  statistical  comparison  of  the  frequencies  observed  in  the  two 
corpora  is  computed  in  order  to  elicit  domain  specific  terminology. 


Frequence 

Term 

Score 

6619 

terrorist 

101, 99 

4209 

terrorism 

92,  80 

4587 

nuclear 

83,01 

3018 

biological 

78,  67 

2520 

weapon 

68,  01 

1895 

Iraq 

61,  35 

2107 

attack 

57,  79 

1885 

domestic 

55,  80 

1200 

department 

47,57 

1125 

al 

47,  18 

2266 

military 

46,  97 

1527 

September 

46,  59 

1048 

Iraqi 

46,  23 

Table.  1:  Sample  List  of  Terrorism  Domain 
Candidate  Terns 


Table  1  shows  a  partial  list  of  terms  extracted  using 
contrastive  corpora.  Scores  quantify  the  observed  deviation 
from  a  normal  distribution.  These  deviations  indicate  that, 
considering  the  two  corpora  used  to  establish  comparison, 
terms  are  statistically  more  related  to  the  terrorism-related 
corpus  than  to  the  other  corpus  used.  This  is  quite  obvious 
with  terms  such  as  nuclear,  biological,  weapon. 

SACOT’s  automatic  terminology  extraction  and  validation 
processes  exploit  contrastive  datasets  and  implement 
approach  proposed  by  [Drouin  2003,  2004]. 

5.3  Named  Entities  Extraction 

Named  Entities  (NE)  represent  another  important  set  of 
knowledge  objects  to  be  captured  in  texts.  The  following 
table  introduce  standard  named  entity  categories  that  have 
been  defined  during  the  Message  Understanding  Conference 
(MUC-7)  [Chinchor,  1997], 


Entity 

Description 

ORGANIZATION 

Named  corporate,  governmental,  or  other  organizational  entity 

PERSON 

Named  person  or  family 

LOCATION 

Name  of  politically  or  geographically  defined  location 
(cities,  provinces,  countries,  international  regions,  bodies 
of  water,  mountains,  etc.) 

DATE 

Complete  or  partial  date  expression 

TIME 

Complete  or  partial  expression  of  time  of  day 

MONEY 

Monetary  expression 

PERCENT 

Percentage 

Table  2:  Standard  Named  Entities  [Chinchor,  1997] 


Named  entities  can  also  include  street  addresses,  Uniform  Resource  Locator  (URL),  email  addresses, 
symbols,  and  measures.  Extending  the  concept  of  named  entity  itself,  named  entities  categories  can  be 
considered  as  classes  and  corresponding  retrieved  information  elements  as  instances  or  individual 
representations  of  those  concepts.  For  instance,  each  different  street  address  found  in  a  text  represents 
a  different  instance  of  the  named  entity  category  called  STREET  ADDRESS.  From  there,  named  entities 
themselves  can  be  formalized  using  an  ontology. 
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When  it  comes  to  domain-specific  named  entities  extraction,  standard  categories  proposed  at  the 
MUC-7  conference  appear  to  be  too  generic.  In  the  SACOT  framework,  the  NE  extraction  module 
exploits  the  GATE7  open  source  software.  New  named  entities  annotation  schema  have  been  defined 
and  new  grammar  rules  have  been  written  and  tested  at  DRDC  Valcartier  to  handle  morph-syntactic 
patterns  specific  to  terrorism  and  to  weapons  of  mass  destruction  (WMD)  domains.  New  named 
entities  classes  such  as  Terrorism_Weapon  have  been  defined  as  well.  The  two  following  figures 
show  a  list  of  ontology  classes  (Fig.  3)  and  how  their  corresponding  named  entities  annotation  schema 
is  used  to  retrieve  terrorism-related  information  in  unstructured  texts  (Fig.  4). 
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[3]  Terrorism_Tactic  #1 
[3]  Terrorism_Target#1 
[3]  Terrorism_Weapon  #1 
[3]  Terrorist_Group#1 

C  SameClassAs 

im 

E  m  {>  \x\  ^  X 

1  1 

Rpctrirtinnc 

Fig.  3:  Partial  Terrorism-related  Named  Entities  Ontology 


and  to  work  with  both  public  and  private  organizations  to  develop 

emergency  preparedness  strategies.  *n  turn,  appointed  the 

Association  of  to  coordinate  the 

fi^'s  role  in  emergency  preparedness.  During  this  time,  this 
public -private  partnership  focused  primarily  on  cyber  security  threats 
for  the  several  hundred  that  each  served  over 

100,000  persons.  The  partnership  was  broadened  in  2001  to  include  both 
the  ;  and  wastewater  sectors,  and  focused  on  systems 

serving  more  than  3,300  people. 

Efforts  to  better  protect  j  S  infrastructure  were 

accelerated  dramatically  after  the  September  11  811(1  1:118 

industry  launched  efforts  to  share  information  on 
terrorist  threats  and  response  strategies.  They  also  undertook 
initiatives  to  develop  guidance  and  training  programs  to  assist 

in  identifying  their  systems 1  As  a  major 

step  in  this  regard,  3E  supported  the  development,  by  O 
¥orks  Association  Research  Foundation  and  Sandia  national  Laboratories, 
assessment  methodology  for  larger  drinking  S3 

assessments  was  then  augmented  by 
the  Public  Health  Security  and  Preparedness  and  Response 

Act  of  2002  (Bioterrorism  Act) .  Among  other  things,  the  act  required 
each  serving  more  than  3,300  individuals  to 

assessment  by  specified  dates  in  2003 


conduct  a  detailed 
or  2004,  depending  on  their  size. 

Since  we  issued  our  report  in  October,  several  Homeland  Security 
Presidential  Directives  (HSPDs)  were  issued  that  denote  new 
responsibilities  for  ^E  and  1:116  •  HSPD  7  designates  as 

the  's  agency  specifically  responsible  for  infrastructure 

protection  activities,  including  developing  a  sector 

plan  for  the  National  Infrastructure  Protection  Plan  that  the 
Department  of  Homeland  Security  must  produce.  HSPD  9  directs  3Z  to 
develop  a  surveillance  and  monitoring  program  to  provide  early  warning 
in  the  event  of  a  O  ^  using  or  ^ 

SOS-  323  is  also  charged,  under  HSPD  9,  with  developing  a  nationwide 


□  DEFAULTTOKEN 
l~~l  Lookup 

|  I  Sentence 

I~1  SpaceToken 

□  |>plit 


□h 


0  Terrorism_Co untry 
0  Terrori$m_Tactic 
0  Terrorism_Target 
0  Terrorism_Weapon 
0  Terrorist_Group 
I~1  Token 

►  Original  markups 


Document  Editor  Initialisation  Parameters 


Fig.  4:  SACOT’s  Terrorism-related  Named  Entities  Automatic  Identification  Using  GATE 


7  GATE:  General  Architecture  for  Text  Engineering  (http://gate.ac.uk) 
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5.3.1  From  Instances  to  Named  Entity  Patterns 

o 

Building  new  named  entity  grammar  rules  requires  analysis  of  lexical  patterns.  Since  early  90 ’s  , 
collocation  techniques  applied  to  textual  corpus  have  been  widely  used  in  the  natural  language 
processing  community  to  identify  recurrent  co-occurring  lexical  items.  Analysis  of  collocations  can 
provide  essential  information  about  term  variations  such  as  in  car  bombing,  car-bombing,  and 
carbombing.  Frequent  collocations  can  lead  to  the  discovery  of  different  instances  of  the  same  class 
(e.g.  biological  weapon',  chemical  weapon',  nuclear  weapon,  radiological  weapon,  etc.).  These  are  all 
different  instances  of  the  class  Terror! Sm  Weapon.  Instead  of  enumerating  all  instances  belonging  to 
this  class  in  the  ontology,  simple  named  entity  grammar  rules  such  as  {jj  +  "weapon |  weapons"}8 9 
will  easily  capture  them.  This  ilustrates  how  analysis  of  collocations  can  be  used  to  create  new 
grammar  rules  from  specific  morph-syntactic  patterns  that  will  capture  instances  of  corresponding 
named  entity  categories.  As  illustrated  in  the  two  following  figures  (Fig.  5,  6),  identification  of 
recurring  patterns  is  based  on  analysis  of  textual  data10. 


Fig.  6:  Most  Frequent  Left  Collocates  for 
Word  “Weapon(s)” 


Left 
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1 
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3 
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4 

chemical 
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5 

their 

89 

6 
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7 

CBRN 

69 

8 
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57 

9 
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50 

10 

the 

32 

11 

such 

22 

12 

conventional 

21 

13 

NBC 

18 

14 

radiological 

17 

15 

unconventional 

17 

16 

these 

16 

17 

for 

13 

18 

frequency 

13 

19 

to 

13 

20 

CB 

12 

21 

automatic 

11 

22 

terrorist 

10 

23 

Against 

9 

United  States  dropped  atomic 
ey  had  a  role  in  the  attack. 
1-Owhali  and  Hohamed  attended 
ssessed  unlicensed  automatic 
Chemical  bombs  Automatic 
y  terrorist  with  an  automatic 
nee 1058  ELN  Bomb,  automatic 
hern  Pakistani 11  Automatic 
Government  troops  Automatic 
the  complex  firing  automatic 
ned  by  Pan  Am517  Automatic 
um  nitrate,  over  70  automatic 
bombs,  ammunition,  automatic 
bombs,  ammunition,  automatic 
measures  have  been  taken;  b. 

,  chemical  or  bacteriological 
st  groups,  as  well  as  bases, 
rcraft.  Iraq  provided  bases, 
IGENCE  IS  THE  BEST 

Good  Intelligence  is  the  Best 
and  nine  attempts  to  use  bio- 
cing  an  effective  biological 
tack,  even  with  a  biological 
tack,  even  with  a  biological 
e  top  drill  for  a  biological 
ned  to  reduce  the  biological 
o  acquire  and  use  biological 
like  chemical  and  biological 
with  chemical  and  biological 
existence  of  its  biological 
the  prototypical  biological 
olved  in  chemical/biological 
created  extensive  biological 
?  With  respect  to  biological 
treaty  governing  biological 
a  Lederberg,  ed..  Biological 
rts.  Brad  (ed. ) ,  /Biological 


weapons  on  the  Japanese  cities  of  Hir 
Jtfeapons  of  Hass  Destruction  (WMD)  Ter 
weapons  and  explosive  training  at  a 
weapons  and  silencers.  WASHINGTON,  D 
w eapon  fire  0  00  0 

weapon  or  one  committed  to  a  s 

weapons  1  1  1990  Chile:  U.S.  emba 
weapons  1  0  2000  Pakistan:  Jaish- 
weapons  >100  ?  1991  Iraq:  After  t 

weapons  and  throwing  grenades 

weapons  gunfire  0  0  1977  Uganda: 

weapons,  and  200  blasting  capsl556 
weapons,  grenades,  and  various  explos 
weapons,  grenades,  and  various  explos 
Weapons  of  Hass  Destruction.  Although 
weapons,  assuming  that  they  have  any 
weapons,  and  protection  to  the  Hujahe 
weapons ,  and  protection  to  the  HEK,  a 
WEAPON  AGAINST  INTERNATION 

Weapon  Against  International 
weapons  by  Aum  that  should  have  been 
weapon  are  not  insurmountable,  they  a 
weapon .  We  can  strengthen  existing  ca 
weapon .  We  can  strengthen  existing  ca 
weapon . "  "Huch  of  the  district  are  i 
weapons  threat.  Security  will  be  incr 
weapons  on  a  mass  scale  face  a  major 
weapons  that  once  produced  in  the  lab 
weapons  or  materials,  using  low-tech 
weapons  program,  Aum  scientists  seeme 
weapons  agent  -  it  is  relatively  easy 
weapons  (CBW)  incidents:  charismatic 
weapons  programs  including  work  on  an 
weapons,  which  pathogens  deserve  prio 
weapons .  Other  mechanisms  exist,  such 
Weapons :  Limiting  the  Threat,  BCSIA  S 
:  Weapons  of  the  Future?  /pp. 


Fig.  5:  Recurring  Left  Collocates  for  Word  “Weapon(s)” 


5.4  Semantic  Relations  Extraction 

Once  terms  and  named  entities  have  been  extracted  and  properly  validated,  next  challenge  consists  in 
identifying  the  different  semantic  relations  those  elements  share  in  texts.  As  stated  in  Bourigault  et  al. 
[2001],  “it  is  generally  admitted  that  texts  contain  several  clues  as  to  the  meaning  of  tenninological 
units.  These  clues  can  be  automatically  or  semi-automatically  detected  and/or  extracted  to  provide  a 


8  See  [Sinclair,  1991;  Smadja,  1993] 

9  { JJ  +  "weapon  |  weapons"}  means  “a  string  made  of  any  token  having  adj  as  part-of-speech  category  and  followed  by  one  of 
tokens  weapon  or  weapons” 

10  Corpus  used  to  generate  those  figures  contains  861916  tokens  from  open  source  terrorism-related  documents. 
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better  understanding  of  what  terms  mean.”  Expressed  by  surface  linguistic  forms,  such  clues  represent 
explicit  semantic  relations  markers  and  provide  means  to  extract  semantic  networks  from  texts. 

Early  work  from  Hearst  [1992]  focused  on  automatic  acquisition  of  hyponyms  sharing  taxonomic 
relationship.  Hyperonymic  and  hyponymic  relationships  (IS  A)  have  been  the  most  studied  conceptual 
structures  in  the  scientific  literature.  Nevertheless,  the  taxonomic  relationship  is  only  one  among  many 
other  types  of  semantic  relationships.  In  his  work  on  retrieval  strategies  for  defining  contexts,  Auger 
[1997]  identified  more  than  150  semantic  relation  markers  and  proposed  a  taxonomy  of  semantic 
relation  types  (Fig.  7).  More  recently,  Condamines  and  Rebeyrolle  [2001]  explored  a  number  of 
conceptual  relationships  in  order  to  build  a  terminological  knowledge  base  from  a  corpus  of  electronic 
texts.  Starting  from  previous  studies  from  Morin  [1999]  and  Seguela  [2001]  on  the  hyperonymic 
relationship,  Malaise  et  al.  [2005,  2004]  extract  defining  contexts  from  texts  to  build  differential 
ontologies.  Barriere  [2001]  and  Khoo  et  al.  [2002]  identified  a  wide  variety  of  linguistic  expressions 
for  explicitly  indicating  cause  and  effect  relationship  in  texts. 

SACOT  framework  exploits  several  semantic 
relation  markers  to  retrieve  semantic  relations 
among  concepts  in  texts.  Extracted  concepts  and 
relations  are  associated  in  triplet  candidates  {Ti, 
SemReli,  T2}  where  {Tn}  is  a  term  and 
{SemRel^}  is  a  semantic  relation.  Figure  7  shows 
a  partial  view  of  the  semantic  relations  taxonomy 
used  in  the  SACOT  knowledge  engineering 
framework. 

As  an  example,  the  following  sentence: 

Anthrax  is  an  acute  infectious  disease 
caused  by  the  spore-forming  bacterium 

Bacillus  anthracis. 

is  rich  in  semantic  relations  markers.  The  semantic 
relation  marker  is_A  suggests  that  Anthrax  is  a 
kind  of  Acute  infectious  disease.  This  is  typical 
taxonomic  relationship.  Therefore,  according  to 
this  text  portion,  anthrax  can  be  said  as  being  a 
member  or  instance  of  the  class  Acute  infectious 
disease.  Moreover,  this  Acute  infectious  disease 
itself  shares  a  causality  relationship  (caused_by)  with  the  instance  bacterium  Bacillus  anthracis. 
Those  semantic  relations  can  be  represented  as  in  Figure  8. 


5.5  Compiling  Draft  Ontologies 

In  the  SACOT  framework,  draft  ontologies  consist  of  local  semantic 
networks  integrating  and  structuring  all  knowledge  objects  captured 
during  previous  extraction  processes.  Those  ontologies  are 
considered  as  draft  because  they  need  to  be  validated  by  SMEs. 
Once  validated,  the  knowledge  objects  of  the  draft  ontology  are 
merged  to  the  domain-specific  ontology  being  built.  Since  those  new 
knowledge  objects  are  now  merged  to  the  domain-specific  ontology, 
SACOT  framework  will  use  them  as  reference  material  at  next 
Fig.  8:  Local  Semantic  Network  iteration  of  extraction  processes. 


Fig.  7:  Semantic  Relation  Types 
(Adapted  from  Auger,  1997) 
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In  the  next  figure  (Fig.  9),  the  three  extraction  modules  of  the  SACOT  knowledge  engineering 
framework  are  applied  to  a  sample  input  text  to  produce  different  validated  lists.  The  extracted  material 
is  then  linked  to  a  local  semantic  network  and,  ultimately,  validated  by  the  SME  as  being  part  of  the 
broader  domain-specific  ontology. 


Sample  Input  Text 


Candidate  Terms 


May  24,  2002 

Anthrax  is  an  acute  infectious 
disease  caused  by  the  spore¬ 
forming  bacterium  Bacillus 
anthracis.  Anthrax  most 
commonly  occurs  in  wild  and 
domestic  lower  vertebrates 
(cattle,  sheep,  goats,  camels, 
antelopes,  and  other 
herbivores),  but  it  can  also 
occur  in  humans  when  they 
are  exposed  to  infected 
animals  or  tissue  from 
infected  animals. 

Anthrax  is  most  common  in 
agricultural  regions  where  it 
occurs  in  animals.  These 
include  South  and  Central 
America,  Southern  and 
Eastern  Europe,  Asia,  Africa, 
the  Caribbean,  and  the  Middle 
East.  When  anthrax  affects 
humans,  it  is  usually  due  to 
an  occupational  exposure  to 
infected  animals  or  their 
products.  Workers  who  are 
exposed  to  dead  animals  and 
animal  products  from  other 
countries  where  anthrax  is 
more  common  may  become 
infected  with  B.  anthracis 
(industrial  anthrax).  Anthrax  in 
wild  livestock  has  occurred  in 
the  United  States. 


Automatic  Terminology 
Extraction  Process 


Automatic  Semantic 
Relations  Extraction 
Process 


anthrax 

acute  infectious  disease 
spore-forming  bacterium 
Bacillus  anthracis 
wild  and  domestic  lower 
vertebrates 
cattle 


Candidate  Named  Entities 


Automatic  Named  Ei 
Extraction  Process 


DATE:  May  24  2002 
GEONAME:  South  and 
Central  America 
GEONAME:  Southern  and 
Eastern  Europe 
GEONAME:  Asia 
GEONAME:  Africa 
GEONAME:  Caribbean 
GEONAME:  Middle  East 
GEONAME:  United  States 


Candidate  Semantic  Relations 


anthrax  IS_A  acute 
infectious  disease 

Bacillus  anthracis 
CAUSES  anthrax 

anthrax  OCCURS_IN  wild 
and  domestic  lower 
vertebrate 

cattle  IS_A  wild  and  lower 
vertebrate 

sheep  IS_A  wild  and 
lower  vertebrate 


Validated  Lists 


anthrax 

acute  infectious  disease 
spore-forming  bacterium 
Bacillus  anthracis 
wild  and  domestic  lower 
vertebrates 
cattle 


DATE:  May  24  2002 
GEONAME:  South 
America 

GEONAME:  Central 
America 

GEONAME:  Southern 
Europe 

GEONAME:  Eastern 
Europe 

GEONAME:  Asia 
GEONAME:  Africa 
GEONAME:  Caribbean 
[...] 


anthrax  IS_A  acute 
infectious  disease 

Bacillus  anthracis 
CAUSES  anthrax 

anthrax  OCCURS_IN  wild 
and  domestic  lower 
vertebrate 

cattle  IS_A  wild  and  lower 
vertebrate 

sheep  IS_A  wild  and 
lower  vertebrate 


Thrid  Party  Application  (e.i 
Knowledge  Portal) 


Fig.  9:  Turning  Electronic  Texts  into  Domain  Ontologies  Using  the  SACOT  Framework 

6.  CONCLUSION  AND  FUTURE  WORK 

The  SACOT  ontology-engineering  framework  significantly  reduces  time  usually  required  to  capture 
the  knowledge  objects  of  a  domain  in  traditional,  fully  human-based,  ontology  building  processes.  It 
provides  knowledge  engineers  with  new  means  to  leverage  ever-increasing  amount  of  domain-specific 
electronic  texts  and  to  rapidly  build  broad  domain  ontologies  for  new  semantic-aware  applications. 
Future  work  on  the  SACOT  framework  will  investigate  how  learning  algorithms  could  be  efficiently 
used  to  monitor  and  learn  from  SMEs’  validation  work.  Future  work  is  also  planned  to  use  the  SACOT 
framework  in  order  to  capture  and  structure  knowledge  objects  from  totally  different  domains.  Finally, 
future  work  will  also  investigate  post-processing  of  captured  knowledge  objects.  More  specifically, 
investigation  will  be  conducted  to  develop  and  apply  semantic  link  analysis  over  knowledge  objects 
provided  by  the  SACOT  environment. 

In  the  midterm,  it  is  expected  that  outcomes  of  this  new  and  integrated  knowledge  engineering 
framework  will  provide  benefits  for  situational  awareness  portals,  for  ontology-based  automatic 
document  classification  systems,  for  ontology-based  data  mining,  for  knowledge  portals,  for  intelligent 
search  engines  and  for  any  other  application  requiring  semantic-level  capabilities.  Further  integration 
efforts  will  be  required  to  validate  those  expectations. 
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This  paper  have  described  how  recent  advances  in  natural  language  processing  techniques  are 
implemented  in  the  SACOT  framework  to  automate  elicitation  of  knowledge  objects  from  unstructured 
texts  and  to  support  efficiently  Subject  Matter  Experts  in  ontology  engineering  tasks. 
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